9  Tidy data

Learning Objectives

After completing this lab you should be able to

For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.

You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

Once you have opened a project you should see the project name in the top right corner1.

  • 1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.

  • There should be a document named 09_tidy-data.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • Let’s start by loading the packages we will need for this activity.

    # load libraries
    library(knitr)
    library(tidyverse)

    9.1 Producing tidy data sets

    The last set of functions that we need to get comfortable with allow us to create tidy data sets.

    Consider this

    List the three characteristics of a tidy data set. Explain why a tidy data set is sometimes also describe as a long data set.

    Did it!

    [Your answer here]

    Let’s read out data set back into our R session.

    # read catch data
    catch <- read_delim("data/longline_catchdata.txt", delim = "\t")
    Consider this

    Take a look at our data set and argue whether or not it is a tidy data set. The easiest way to do this is to determine if it fullfills all the characteristics.

    Did it!

    [Your answer here]

    Let’s quickly reformat our catch data as follows

    catch_length <- catch %>%
      unite(SetID, Year, Month, Day, Set, sep = "_") %>%
      select(SetID, Site, Species, Sex, PCL, FL, STL)
    
    head(catch_length)
    # A tibble: 6 × 7
      SetID       Site        Species       Sex     PCL    FL   STL
      <chr>       <chr>       <chr>         <chr> <dbl> <dbl> <dbl>
    1 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   287   353
    2 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   425   495
    3 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   416   502
    4 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   416   507
    5 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   418   510
    6 2015_7_28_1 Aransas_Bay Bagre_marinus U        NA   434   515

    We can turn this into a tidy data set using pivot_longer(). To do this we have to identify columns that will be used as the key (cols =) and then name the column that will hold those values (names_to()) and the column that will hold the value (values_to()).

    In this case, we have made three observations about length for each specimen, in order to have rows with unique observations we want a column that identifies what type of observation was made, for example called Measurement. This is called the “key” because it allows us to “unlock” what type of measurement the individual observation is, i.e. this column will let us know whether an observation (row) is pre-caudal length, fork length, or stretch total length.

    We will designate another column Length to hold the values for each measurement.

    We can identify the columns that need to be gathered either by name or since we have re-arranged our dataframe so they are the last columns by column number.

    tidy_length <- catch_length %>%
      pivot_longer(names_to = "Measurement", values_to = "Length", cols = 5:7)
    Consider this

    Briefly outline advantages to using tidy data sets.

    Did it!

    [Your answer here]

    With this data set it would be straightforward for us to e.g. calculate mean values for each length measurement by species using group_by() and summarize().

    tidy_length %>%
      group_by(Species, Measurement) %>%
      summarize(mean = mean(Length, na.rm = TRUE))
    # A tibble: 42 × 3
    # Groups:   Species [14]
       Species                 Measurement  mean
       <chr>                   <chr>       <dbl>
     1 Bagre_marinus           FL           433.
     2 Bagre_marinus           PCL          NaN 
     3 Bagre_marinus           STL          517.
     4 Carcharhinus_brevipinna FL           644.
     5 Carcharhinus_brevipinna PCL          583.
     6 Carcharhinus_brevipinna STL          804.
     7 Carcharhinus_leucas     FL           769 
     8 Carcharhinus_leucas     PCL          691.
     9 Carcharhinus_leucas     STL          936.
    10 Carcharhinus_limbatus   FL           613.
    # ℹ 32 more rows

    9.2 Convert a tidy data set to wide format

    Despite all the advantages of tidy data sets you can see from the table above that frequently when we are presenting results in a table it may be advantageous in terms of layout to have a non-tidy format.

    This can be done using pivot_wider() which works like pivot_longer() but in reverse. You designate which column is the key (names_from =), i.e. these will become the column names in the new table. Then you need to identify which column in your current data frame contains the values that should be filled out/spread into the columns that will be generated from your key (values_from =).

    Since we don’t have values for precaudal length, we probably want to use filter() to remove those rows first.

    More notes on naming things … recall, that we said that filenames should not contain spaces or special characters? We set similar rules for naming objects. Well, column names is a similar conundrum. Including spaces or species characters as a column name creates problems when we are using functions like select() to subset by column name or mutate() to create new columns based on exisiting columns. Similarly, if the column name is a number you will have problems. If you do have unconvential column names you can rename them using rename() or you can use backticks and either side of the name to indicate that it is a column name.

    tidy_length %>%
      filter(!Measurement == "PCL") %>%
      group_by(Species, Measurement) %>%
      summarize(mean = mean(Length, na.rm = TRUE)) %>%
      pivot_wider(names_from = "Measurement", values_from = "mean")
    # A tibble: 14 × 3
    # Groups:   Species [14]
       Species                       FL   STL
       <chr>                      <dbl> <dbl>
     1 Bagre_marinus               433.  517.
     2 Carcharhinus_brevipinna     644.  804.
     3 Carcharhinus_leucas         769   936.
     4 Carcharhinus_limbatus       613.  776.
     5 Carcharhinus_porosus        415   475 
     6 Hypanus_americanus          NaN   954.
     7 Hypanus_sabina              NaN   349.
     8 Rhinoptera_bonasus          NaN   819 
     9 Rhizoprionodon_terraenovae  412   510.
    10 Sciades_felis               299.  343.
    11 Sciaenops_ocellatus         793   932.
    12 Sphyrna_lewini              471.  628 
    13 Sphyrna_tiburo              622.  792.
    14 Synodus_foetens             173   185 
    Give it a whirl

    Calculate the number of individuals per species caught per month in 2018 and present that data in a wide formate to make it easy to compare the number of species (species) per month (columns). As a bonus create an additional column with total catch of that species for 2018.

    Did it!

    [Your answer here]